Authors: Michał Woźniak (385190) & Michał Wrzesiński (385197)
The e-commerce market has been growing rapidly over the past few years, and the Covid-19 pandemic has further enhanced this effect by making irreversible changes in consumer habits. Currently, one of the fastest growing e-commerce markets in Europe is Poland. Companies such as: Allegro, OLX, Amazon, Aliexpress have an average daily turnover of several million baskets. PWC estimated that in 2026 the gross value of the Polish e-commerce market will be at the level of 162 billion PLN.
In order to handle such volume of parcels, well-functioning logistics companies are required to deliver the ordered parcel within a short time from the order. In Poland, there are several international logistic companies, e.g. DHL, DPD; Poczta Polska is an important local player; however, InPost enjoys the greatest recognition among customers. Its success has two sources: firstly, InPost has developed a network of parcel machines (Paczkomat) which have revolutionized the way we receive parcels; secondly, InPost has perfectly operationalized logistics thus is able to deliver parcels in a very short time.
One of the obvious elements of InPost's successful logistics is an appropriate deployment strategy for parcel pick-up points. This point determines how convenient the InPost service is for the end user. As we suppose, the models that determine the choice of a spatial point to place a Paczkomat are not ideal because they do not naturally take into account constraints such as the inability to lease a given point in space. Thus, departments responsible for the expansion of new pick-up points can often make decisions about placing a parcel machine in an intuitive way (for instance close to the shopping mall), without statistical basis (they probably have their own targets to achieve - to receive bonuses and they do not care about evaluation of their choice in the short/mid-term). In addition, recently, more and more competitors for InPost have appeared (adopting the same model of delivery into parcel machine), such as Ruch or even Poczta Polska. It seems to be logical that they possibly copy the deployment choices of InPost. Therefore, it seems very interesting to analyze how InPost's pick-up points deployment situation looks like compared to the competition. In addition, it is worth verifying whether the deployment determinants of InPost's Paczkomats are consistent with the literature. We are referring here to the Morganti et al. (2014) publication which suggests such indicators of pick-up point deployment: demographic indicators (population density, employment rate, computer ownership, Internet access), centers and nodes for city users ("parameters related to end-consumers' mobility and accessibility to socio-economic activities, in particular end-consumers' use of both public transport and private cars, and the density of retail outlets and commercial services, business and employment sites, cultural and leisure centers and public transportation nodes"), parcels flow within the network (mostly related to the transport system and users preferences).
The purpose of this project is to examine the deployment strategy of InPost parcel pick-up points compared to the competition in multivariate environment. Furthermore, we want to test whether the deployment determinants (herein referred to as control variables) indicated by the literature and our modeling intuition are relevant to InPost's strategy.
We state following research questions:
To answer the above research questions, we conducted a full econometric analysis based primarily on spatial statistics and spatial models. In order to simplify the computational complexity of the task, we focused only on two cities in Poland: Warsaw and Cracow. These are two key cities from the point of view of InPost (Warsaw - the capital of Poland, Krakow - where InPost was established). Depending on the modeling approach we used different levels of data aggregation for Spatial dependence models - grid 1km x 1km, for Spatial drift models - point data. We are testing those approaches because in this problem, both Spatial Autocorrelation and Spatial Drift seem to be intuitive phenomena.
We present following table of content for this research:
This part is devoted to data collection process. As the output we obtain raw data which will be transformed to the final dataset in the 3. Dataset construction section.
Generally, we devided data into 4 categories:
Pick-up points data comes from websites like: Bliskapaczka.pl and DHL.
Spatial shapes data comes from GUGIK (head office of geodesy and cartography in Poland).
Demographic data are taken from the Inspire repository and it represents indicators for 1km2 grids in Poland.
Finally, we obtained points of interest data from OSM repository. This is amazing site which store snapshots of the OSM in shape files!!!
from google_drive_downloader import GoogleDriveDownloader
import requests
import json
import numpy as np
import pandas as pd
%config Completer.use_jedi = False
We define some utilities for code reproducibility.
def download_gd_data_from_dict(dictionary: dict):
'''
Download, save and unzip data from Google Drive
'''
for i,j in dictionary.items():
GoogleDriveDownloader.download_file_from_google_drive(file_id=j,
dest_path=f"../datasets/raw_data/{i}/{i}.zip",
unzip=True,
showsize=True,
overwrite=False)
def download_json_data_from_url(name:str, url: str):
'''
Download and save data from JSON API outputs
'''
response = requests.get(url).text
df = pd.DataFrame(json.loads(response))
df.to_csv(f"../datasets/raw_data/{name}.csv")
We decided to download data from GUIGK, OSM and Inspire (using links attached in the introduction to this stage of study) and store it on our academic Google Drive to obtain reproducibility in any time. Thanks to our functional utilities we can just pass direct link to the file and then download, store and unzip files with the data! Does data are not stored in our remote git repository due to their size, but thanks to Google Drive they are available for anyone!
You can also inspect the file via Browser just combine: https://drive.google.com/file/d/ + file_id, for instace: https://drive.google.com/file/d/1BZCmADIZhJuf1_Jh-f6D8vSpI8p5-2wd
gugik = {'guigk_voi':'1BZCmADIZhJuf1_Jh-f6D8vSpI8p5-2wd',
'guigk_pov':'1wX99dmNUbiEKYKh-qAfxDipT9oC6DLzE',
'guigk_com':'1URjb9NM6Fm_qES5kC4QPPXZGERzarUIa'}
download_gd_data_from_dict(gugik)
osm = {'osm_mazowieckie':'195E_n9JlgavFWp4mbaOCHAYKFWziBkc0',
'osm_malopolskie':'1KG6uPhCZ-jKDgEpBU46WKHXVG_Mc-dBS'}
download_gd_data_from_dict(osm)
inspire = {"inspire":"1avnBMziIn9uLetSbucMrZlZadhnSUvPE"}
download_gd_data_from_dict(inspire)
We collect that about pick-up points from two website or to be more precise from their APIs. It is the smartest way to gather this data in seconds!
url = 'https://pos.bliskapaczka.pl/api/v1/pos?fields=operator%2Ccode%2Clatitude%2Clongitude%2Cbrand%2CbrandPretty%2CoperatorPretty%2Ccod%2Cavailable%2C+city%2C+street&operators=RUCH%2CINPOST%2CPOCZTA%2CDPD%2CUPS%2CFEDEX'
download_json_data_from_url("bliska_paczka", url)
url = 'https://parcelshop.dhl.pl/index.php/points?type=lm&country=pl&ptype=parcelShop&hours_from=10&hours_to=16&week_days_PON=T&week_days_WT=T&week_days_SR=T&week_days_CZW=T&week_days_PT=T&week_days_SOB=N&week_days_NIEDZ=N&options_pickup_cod=N&show_on_map_parcelshop=T&show_on_map_parcelstation=T&show_on_map_pok=T&tab=pickup'
download_json_data_from_url("dhl", url)
import numpy as np
import pandas as pd
import geopandas as gpd
from shapely import geometry
from pyproj import CRS
import geog
import shapely
from shapely.geometry import Point
import shapely.wkt
import matplotlib.pyplot as plt
import gc
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
%config Completer.use_jedi = False
import warnings
warnings.filterwarnings("ignore")
In this part of the study, we merge all the data into complete datasets. What is worth mentioning is that we collect data for 2 cities: Warszawa and Kraków, both for 1km2 grids and 500m buffers around Inpost points (only for GWR). In the case of merging for grids we took an assumption about intersection of grid with poviat (it was enough for grids to intersect with poviat boundaries). Additionally, we count other data for each grid. In turn, in the case of merging for buffers, we also count data for each buffer and also we use data from Inspire (if more than one grid intersects with buffer we take avarage value from those grids).
In this section, we merge the files for parcel points. Finally, we get a file that contains geolocation data for all available pickup points.
bliska_paczka = pd.read_csv("../datasets/raw_data/bliska_paczka.csv", index_col=0)
bliska_paczka = bliska_paczka.loc[bliska_paczka.available == True, ("brand", "operator", "city", "street", "longitude", "latitude")]
bliska_paczka.brand = bliska_paczka.brand.str.lower()
bliska_paczka.operator = bliska_paczka.operator.str.lower()
bliska_paczka.city = bliska_paczka.city.str.lower()
bliska_paczka.street = bliska_paczka.street.str.lower()
bliska_paczka = bliska_paczka.loc[:, ("brand", "operator", "longitude", "latitude")]
dhl = pd.read_csv("../datasets/raw_data/dhl.csv", index_col=0)
dhl.P_TYPE = dhl.P_TYPE.str.lower()
dhl = dhl.drop(columns=["ID"])
dhl.columns = ["brand", "latitude", "longitude"]
dhl["operator"] = "dhl"
dhl = dhl[["brand", "operator", "longitude", "latitude"]]
df = pd.concat([bliska_paczka, dhl], axis=0)
df = df.drop(columns=["brand"])
df = gpd.GeoDataFrame(df, geometry=[geometry.Point(xy) for xy in zip(df.longitude, df.latitude)], crs=CRS("epsg:4258"))
df = df.drop(columns=["latitude", "longitude"])
df.to_csv('../datasets/preprocessed_data/pickup_points_by_operator.csv', index = False)
In the next section, we load data for GUGIK and Inspire. GUGIK data is necessary to obtain polygons for 2 poviats (Warszawa and Kraków). In turn, Inspire data provides us with demographic variables that define the population, both total and distinguished by gender and age for 1 km2 grids. We merge both of the above data sets by location, i.e. grids within and intersecting with poviats remain in our data set. Similarly, we add data for parcels of points according to their presence within poviats.
pov = gpd.read_file("../datasets/raw_data/guigk_pov/Powiaty.shx", encoding='utf-8')
pov = pov.loc[pov.JPT_NAZWA_.isin(["powiat Warszawa", "powiat Kraków"])==True, ("JPT_NAZWA_", "geometry")]
pov = pov.to_crs("epsg:4258")
grids = gpd.read_file("../datasets/raw_data/inspire/PD_STAT_GRID_CELL_2011.shp", encoding='utf-8')
grids = grids[['TOT', 'TOT_0_14', 'TOT_15_64', 'TOT_65__', 'TOT_MALE', 'TOT_FEM',
'MALE_0_14', 'MALE_15_64', 'MALE_65__', 'FEM_0_14', 'FEM_15_64',
'FEM_65__', 'FEM_RATIO', 'geometry']]
grids = grids.to_crs("epsg:4258")
pov_grids = gpd.sjoin(grids, pov, how="inner", op="intersects")
pov_grids = pov_grids.drop(columns=["index_right"])
pov_grids = pov_grids.reset_index()
pov_grids = pov_grids.rename(columns={"index":"grid_index"})
pov_grids_sliced = pov_grids[["grid_index", "geometry"]].copy()
df_pov_grids_sliced = gpd.sjoin(df, pov_grids_sliced, how="right", op="within")
df_pov_grids_sliced = df_pov_grids_sliced.drop(columns=["index_left"])
df_pov_grids_sliced = df_pov_grids_sliced.groupby(["operator", "grid_index"], as_index=False).count()
df_pov_grids_sliced = df_pov_grids_sliced.pivot(index='grid_index', columns='operator').fillna(0).reset_index()
df_pov_grids_sliced.columns = df_pov_grids_sliced.columns.droplevel()
df_pov_grids_sliced.columns = ['grid_index', 'dhl', 'dpd', 'fedex', 'inpost', 'poczta', 'ruch', 'ups']
df_pov_grids = pd.merge(pov_grids, df_pov_grids_sliced, how="left", on="grid_index")
df_pov_grids.columns = df_pov_grids.columns.str.lower()
df_pov_grids = df_pov_grids.fillna(0)
df_pov_grids = df_pov_grids[['grid_index','geometry', 'jpt_nazwa_', 'dhl', 'dpd', 'fedex', 'inpost', 'poczta', 'ruch',
'ups', 'tot', 'tot_0_14', 'tot_15_64', 'tot_65__', 'tot_male',
'tot_fem', 'male_0_14', 'male_15_64', 'male_65__', 'fem_0_14',
'fem_15_64', 'fem_65__', 'fem_ratio']]
del bliska_paczka, dhl, pov_grids, pov_grids_sliced, df_pov_grids_sliced; gc.collect();
Finally, in order to merge OSM (points of interest) variables to the final dataset, we created a function that loads OSM data for voivodeships. Then we count the points of interest for each grid. Thus, we obtain the final dataset for grids.
def merging_vars(data, poviat, path):
# df for concrete poviat
df_output = data[data.jpt_nazwa_ == poviat]
# loading points of interest
buildings = gpd.read_file(path + 'gis_osm_buildings_a_free_1.shp')
landuse = gpd.read_file(path + 'gis_osm_landuse_a_free_1.shp')
pois_a = gpd.read_file(path + 'gis_osm_pois_a_free_1.shp')
railways = gpd.read_file(path + 'gis_osm_railways_free_1.shp')
roads = gpd.read_file(path + 'gis_osm_roads_free_1.shp')
traffic_a = gpd.read_file(path + 'gis_osm_traffic_a_free_1.shp')
traffic = gpd.read_file(path + 'gis_osm_traffic_free_1.shp')
transport_a = gpd.read_file(path + 'gis_osm_transport_a_free_1.shp')
# concrete points from above dfs
buildings_points = buildings[buildings['type'].isin(['house', 'residential', 'bungalow', 'apartment'])][['osm_id', 'geometry']]
shop_points = buildings[buildings['type'].isin(['supermarket', 'bakery', 'kiosk', 'mall', 'department_store', 'convenience', 'clothes', 'florist', 'chemist'])][['osm_id', 'geometry']]
parks_points = landuse[landuse['fclass'] == 'park'][['osm_id', 'geometry']]
forest_points = landuse[landuse['fclass'] == 'forest'][['osm_id', 'geometry']]
schools_points = pois_a[pois_a['fclass'].isin(['school', 'playground'])][['osm_id', 'geometry']]
railways_points = railways[['osm_id', 'geometry']]
cycleways_points = roads[roads['fclass'] == 'cycleway'][['osm_id', 'geometry']]
parking_points = traffic_a[traffic_a['fclass'] == 'parking'][['osm_id', 'geometry']]
crossing_points = traffic[traffic['fclass'] == 'crossing'][['osm_id', 'geometry']]
bus_stop_points = transport_a[transport_a['fclass'] == 'bus_stop'][['osm_id', 'geometry']]
# unnecessery dfs
del buildings, landuse, pois_a, railways, roads, traffic_a, traffic, transport_a; gc.collect()
# changing crs
buildings_points = buildings_points.to_crs("epsg:4258")
shop_points = shop_points.to_crs("epsg:4258")
parks_points = parks_points.to_crs("epsg:4258")
forest_points = forest_points.to_crs("epsg:4258")
schools_points = schools_points.to_crs("epsg:4258")
railways_points = railways_points.to_crs("epsg:4258")
cycleways_points = cycleways_points.to_crs("epsg:4258")
parking_points = parking_points.to_crs("epsg:4258")
crossing_points = crossing_points.to_crs("epsg:4258")
bus_stop_points = bus_stop_points.to_crs("epsg:4258")
# list of dataframes
list_of_dfs = [buildings_points, shop_points, parks_points, forest_points, schools_points, railways_points,
cycleways_points, parking_points, crossing_points, bus_stop_points]
# names of new columns
names = ['buildings', 'shops', 'parks', 'forests', 'schools', 'railways',
'cycleways', 'parkings', 'crossings', 'bus_stops']
# groupby points in a loop
for i in range(len(list_of_dfs)):
actual_point = gpd.sjoin(list_of_dfs[i], df_output, how="inner", op="intersects")
x = actual_point[['osm_id', 'grid_index']].groupby(['grid_index']).count()
x.rename(columns={"osm_id": names[i]}, inplace=True)
x.reset_index(inplace = True)
df_output = df_output.merge(x, on = 'grid_index', how='outer')
df_output.fillna(0, inplace=True)
df_output.drop(columns = {'jpt_nazwa_'}, inplace = True)
return df_output
df_krakow = merging_vars(df_pov_grids, 'powiat Kraków', '../datasets/raw_data/osm_malopolskie/')
df_krakow[['tot', 'tot_0_14', 'tot_15_64', 'tot_65__', 'tot_male', 'tot_fem',
'male_0_14', 'male_15_64', 'male_65__', 'fem_0_14', 'fem_15_64',
'fem_65__']] = df_krakow[['tot', 'tot_0_14', 'tot_15_64', 'tot_65__', 'tot_male', 'tot_fem',
'male_0_14', 'male_15_64', 'male_65__', 'fem_0_14', 'fem_15_64',
'fem_65__']].astype(float)
df_krakow.to_csv('../datasets/preprocessed_data/df_krakow.csv', index = False)
df_krakow.to_file('../datasets/preprocessed_data/df_krakow.shp')
del df_krakow; gc.collect()
df_warszawa = merging_vars(df_pov_grids, 'powiat Warszawa', '../datasets/raw_data/osm_mazowieckie/')
df_warszawa[['tot', 'tot_0_14', 'tot_15_64', 'tot_65__', 'tot_male', 'tot_fem',
'male_0_14', 'male_15_64', 'male_65__', 'fem_0_14', 'fem_15_64',
'fem_65__']] = df_warszawa[['tot', 'tot_0_14', 'tot_15_64', 'tot_65__', 'tot_male', 'tot_fem',
'male_0_14', 'male_15_64', 'male_65__', 'fem_0_14', 'fem_15_64',
'fem_65__']].astype(float)
df_warszawa.to_csv('../datasets/preprocessed_data/df_warszawa.csv', index = False)
df_warszawa.to_file('../datasets/preprocessed_data/df_warszawa.shp')
del df_warszawa; gc.collect()
When constructing the dataset for GWR, we followed the same steps as for the above dataset. The only difference is that we collected data for 500m buffers.
df_geo = df.copy()
df_geo['point_id'] = list(range(0, df_geo.shape[0]))
pov_df_geo = gpd.sjoin(pov, df_geo, how="inner", op="intersects")
pov_df_geo = pov_df_geo.merge(df_geo, on = 'point_id')
pov_df_geo = pov_df_geo.drop(columns = {'geometry_x', 'index_right', 'operator_y'})
pov_df_geo = pov_df_geo.rename(columns={"geometry_y": "geometry", "JPT_NAZWA_": "jpt_nazwa_", "operator_x": "operator"})
inpost_points = pov_df_geo[pov_df_geo['operator'] == 'inpost']
# creating buffer (500m radius)
def buffer(point):
n_points = 50
angles = np.linspace(0, 360, n_points)
radius = 500
polygon = geog.propagate(point, angles, radius)
x = polygon.tolist()
lon = list(list(zip(*x))[0])
lat = list(list(zip(*x))[1])
pts = gpd.GeoSeries([Point(x, y) for x, y in zip(lon, lat)])
poly = geometry.Polygon([[p.x, p.y] for p in pts])
polyg = gpd.GeoSeries(poly)
return polyg
inpost_points['buffer'] = inpost_points.apply(lambda x: buffer(x['geometry']), axis=1)
inpost_points = inpost_points.rename(columns={"geometry": "center"})
inpost_buffers = gpd.GeoDataFrame(inpost_points, geometry = 'buffer', crs = "epsg:4258")
inpost_buffers = inpost_buffers.reset_index()
inpost_buffers = inpost_buffers.rename(columns={"index":"buffer_index"})
def merge_points(data_inpost, data_operator, operator):
new_df = data_operator[data_operator['operator'] == operator]
new_df = gpd.GeoDataFrame(new_df, geometry = 'geometry', crs = "epsg:4258")
actual_point = gpd.sjoin(new_df, data_inpost, how="inner", op="intersects")
x = actual_point[['operator_left', 'buffer_index']].groupby(['buffer_index']).count()
x.rename(columns={"operator_left": operator+'_points'}, inplace=True)
x.reset_index(inplace = True)
df_output = data_inpost.merge(x, on = 'buffer_index', how='outer')
df_output.fillna(0, inplace=True)
return df_output
operators = ['inpost', 'poczta', 'dhl', 'ruch', 'dpd', 'ups', 'fedex']
for operator in operators:
inpost_buffers = merge_points(inpost_buffers, pov_df_geo, operator)
buff = gpd.sjoin(inpost_buffers, grids, how="inner", op="intersects")
x = buff[['buffer_index', 'TOT', 'TOT_0_14', 'TOT_15_64', 'TOT_65__', 'TOT_MALE', 'TOT_FEM',
'MALE_0_14', 'MALE_15_64', 'MALE_65__', 'FEM_0_14', 'FEM_15_64',
'FEM_65__', 'FEM_RATIO']].groupby(['buffer_index']).mean()
inpost_buffers = inpost_buffers.merge(x, on='buffer_index')
def merge_osm(data, poviat, path):
# df for concrete poviat
df_output = data[data.jpt_nazwa_ == poviat]
# loading points of interest
buildings = gpd.read_file(path + 'gis_osm_buildings_a_free_1.shp')
landuse = gpd.read_file(path + 'gis_osm_landuse_a_free_1.shp')
pois_a = gpd.read_file(path + 'gis_osm_pois_a_free_1.shp')
railways = gpd.read_file(path + 'gis_osm_railways_free_1.shp')
roads = gpd.read_file(path + 'gis_osm_roads_free_1.shp')
traffic_a = gpd.read_file(path + 'gis_osm_traffic_a_free_1.shp')
traffic = gpd.read_file(path + 'gis_osm_traffic_free_1.shp')
transport_a = gpd.read_file(path + 'gis_osm_transport_a_free_1.shp')
# concrete points from above dfs
buildings_points = buildings[buildings['type'].isin(['house', 'residential', 'bungalow', 'apartment'])][['osm_id', 'geometry']]
shop_points = buildings[buildings['type'].isin(['supermarket', 'bakery', 'kiosk', 'mall', 'department_store', 'convenience', 'clothes', 'florist', 'chemist'])][['osm_id', 'geometry']]
parks_points = landuse[landuse['fclass'] == 'park'][['osm_id', 'geometry']]
forest_points = landuse[landuse['fclass'] == 'forest'][['osm_id', 'geometry']]
schools_points = pois_a[pois_a['fclass'].isin(['school', 'playground'])][['osm_id', 'geometry']]
railways_points = railways[['osm_id', 'geometry']]
cycleways_points = roads[roads['fclass'] == 'cycleway'][['osm_id', 'geometry']]
parking_points = traffic_a[traffic_a['fclass'] == 'parking'][['osm_id', 'geometry']]
crossing_points = traffic[traffic['fclass'] == 'crossing'][['osm_id', 'geometry']]
bus_stop_points = transport_a[transport_a['fclass'] == 'bus_stop'][['osm_id', 'geometry']]
# unnecessery dfs
del buildings, landuse, pois_a, railways, roads, traffic_a, traffic, transport_a; gc.collect()
# changing crs
buildings_points = buildings_points.to_crs("epsg:4258")
shop_points = shop_points.to_crs("epsg:4258")
parks_points = parks_points.to_crs("epsg:4258")
forest_points = forest_points.to_crs("epsg:4258")
schools_points = schools_points.to_crs("epsg:4258")
railways_points = railways_points.to_crs("epsg:4258")
cycleways_points = cycleways_points.to_crs("epsg:4258")
parking_points = parking_points.to_crs("epsg:4258")
crossing_points = crossing_points.to_crs("epsg:4258")
bus_stop_points = bus_stop_points.to_crs("epsg:4258")
# list of dataframes
list_of_dfs = [buildings_points, shop_points, parks_points, forest_points, schools_points, railways_points,
cycleways_points, parking_points, crossing_points, bus_stop_points]
# names of new columns
names = ['buildings', 'shops', 'parks', 'forests', 'schools', 'railways',
'cycleways', 'parkings', 'crossings', 'bus_stops']
# groupby points in a loop
for i in range(len(list_of_dfs)):
actual_point = gpd.sjoin(list_of_dfs[i], df_output, how="inner", op="intersects")
x = actual_point[['osm_id', 'buffer_index']].groupby(['buffer_index']).count()
x.rename(columns={"osm_id": names[i]}, inplace=True)
x.reset_index(inplace = True)
df_output = df_output.merge(x, on = 'buffer_index', how='outer')
df_output.fillna(0, inplace=True)
df_output.drop(columns = {'jpt_nazwa_'}, inplace = True)
return df_output
df_krakow_inpost = merge_osm(inpost_buffers, 'powiat Kraków', '../datasets/raw_data/osm_malopolskie/')
df_krakow_inpost.to_csv('../datasets/preprocessed_data/df_krakow_gwr.csv', index = False)
del df_krakow_inpost; gc.collect();
df_warszawa_inpost = merge_osm(inpost_buffers, 'powiat Warszawa', '../datasets/raw_data/osm_mazowieckie/')
df_warszawa_inpost.to_csv('../datasets/preprocessed_data/df_warszawa_gwr.csv', index = False)
del df_warszawa_inpost; gc.collect();
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid')
% matplotlib inline
# geo libraries
import shapely
from shapely.geometry import Point, Polygon
import shapely.wkt
import geopandas as gpd
import contextily as ctx
# colors
from matplotlib import cm
from matplotlib.colors import ListedColormap, LinearSegmentedColormap
from mpl_toolkits.axes_grid1 import make_axes_locatable
import matplotlib.colors as mcolors
In this part, we present visualizations for the main part of our research, i.e. grids data. In the first step, we perform a data transformation. In the second step, we present maps showing interesting relationships for each of the analyzed cities. We focus primarily on confronting Inpost points with their competition, as well as on showing interesting variables, both demographic and determining points of interest.
Taking into account the transformations, we obtain the following variables for visualization:
Additionally, we create own colour pallets with make_colormap function.
We present visualizations according to variables, i.e. each visualized feature for 2 cities.
df_warszawa =pd.read_csv("../datasets/preprocessed_data/df_warszawa.csv")
df_krakow = pd.read_csv("../datasets/preprocessed_data/df_krakow.csv")
df_warszawa['city'] = 'Warszawa'
df_krakow['city'] = 'Krakow'
df_all = pd.concat([df_warszawa, df_krakow])
def transform_df(df):
# geo changes
df['geometry'] = df.apply(lambda x: shapely.wkt.loads(x['geometry']), axis=1)
df = gpd.GeoDataFrame(df, geometry = 'geometry', crs = "epsg:4258")
# new variables
df['operators'] = df.apply(lambda x: [x['inpost'], x['poczta'], x['dhl'], x['dpd'], x['ruch'], x['ups'], x['fedex']], axis = 1)
df['leader'] = df.apply(lambda x: np.max(x['operators']) != 0, axis = 1)
df['pos'] = df.apply(lambda x: np.where((x['leader'] == 1), np.argmax(x['operators']), 100), axis = 1)
# dictionaries for applying
dict_pos = {0: 'inpost', 1: 'poczta', 2: 'dhl', 3: 'other', 4: 'other', 5: 'other', 6: 'other', 100: 'no_leader'}
dict_pos2 = {0: 'inpost', 1: 'poczta', 2: 'dhl', 3: 'dpd', 4: 'ruch', 5: 'ups', 6: 'fedex', 100: 'no_leader'}
# new variables
df['leader'] = df['pos'].replace(dict_pos)
df['no_leader'] = 0
df['leader_number'] = df.apply(lambda x: x[dict_pos2[float(x['pos'])]], axis = 1)
df['other'] = df['dhl'] + df['dpd'] + df['fedex'] + df['poczta'] + df['ruch'] + df['ups']
df['dominance_inpost'] = df['inpost'] > df['other']
df['dominance_inpost2'] = df['inpost'] > 1/2*df['other']
# changing crs (essential change for library contextily to show visualisation on map)
df = df.to_crs(epsg=3857)
return df
df_warszawa = transform_df(df_warszawa)
df_krakow = transform_df(df_krakow)
df_all = transform_df(df_all)
# function from stackoverflow for making own cmap
def make_colormap(seq):
"""Return a LinearSegmentedColormap
seq: a sequence of floats and RGB-tuples. The floats should be increasing
and in the interval (0,1).
"""
seq = [(None,) * 3, 0.0] + list(seq) + [1.0, (None,) * 3]
cdict = {'red': [], 'green': [], 'blue': []}
for i, item in enumerate(seq):
if isinstance(item, float):
r1, g1, b1 = seq[i - 1]
r2, g2, b2 = seq[i + 1]
cdict['red'].append([item, r1, r2])
cdict['green'].append([item, g1, g2])
cdict['blue'].append([item, b1, b2])
return mcolors.LinearSegmentedColormap('CustomMap', cdict)
c = mcolors.ColorConverter().to_rgb
# our own palletes with colours
operators_cmap = make_colormap([c('red'), 0.2, c('magenta'), 0.4, c('grey'), 0.6, c('green'), 0.8, c('blue')])
blues = make_colormap([c('grey'), c('skyblue'), 0.1, c('skyblue'), c('aqua'), 0.2, c('aqua'), c('blue')])
greens = make_colormap([c('grey'), c('lightgreen'), 0.1, c('lightgreen'), c('lime'), 0.4, c('lime'), c('green')])
greys_cmap = make_colormap([c('grey'), c('grey'), 0.33, c('grey'), c('grey'), 0.66, c('grey')])
reds_cmap = make_colormap([c('lightcoral'), c('salmon'), 0.2, c('salmon'), c('indianred'), 0.4, c('indianred'), c('firebrick')])
blues_cmap = make_colormap([c('lightblue'), c('aqua'), 0.2, c('aqua'), c('deepskyblue'), 0.4, c('deepskyblue'), c('blue')])
greens_cmap = make_colormap([c('lightgreen'), c('lime'), 0.2, c('lime'), c('limegreen'), 0.4, c('limegreen'), c('green')])
pinks_cmap = make_colormap([c('violet'), c('magenta'), 0.2, c('magenta'), c('darkviolet'), 0.4, c('darkviolet'), c('purple')])
binary_cmap = make_colormap([c('red'), c('red'), 0.5, c('green'), c('green')])
def visualisation_basic(df):
fig, ax = plt.subplots(1, 2, figsize = (30, 10))
print('\nBasic plot of city')
ctx.add_basemap(df[df.city == 'Warszawa'].plot(figsize=(20,10), alpha = 0.4, edgecolor='black', linewidth=3, legend = True, ax = ax[0]))
ax[0].set_axis_off()
ax[0].set_title('Basic plot of Warszawa')
ctx.add_basemap(df[df.city == 'Krakow'].plot(figsize=(20,10), alpha = 0.4, edgecolor='black', linewidth=3, legend = True, ax = ax[1]))
ax[1].set_axis_off()
ax[1].set_title('Basic plot of Kraków')
plt.plot()
plt.show()
def visualisation_leader(df):
fig, ax = plt.subplots(1, 2, figsize = (30, 10))
print('\nLeader of operators')
ctx.add_basemap(df[df.city == 'Warszawa'].plot(figsize=(20,10), column="leader", categorical = True, legend=True, cmap=operators_cmap, edgecolor='black', alpha = 0.7, ax = ax[0]))
ax[0].set_axis_off()
ax[0].set_title('Leader of operators in Warszawa')
ctx.add_basemap(df[df.city == 'Krakow'].plot(figsize=(20,10), column="leader", categorical = True, legend=True, cmap=operators_cmap, edgecolor='black', alpha = 0.7, ax = ax[1]))
ax[1].set_axis_off()
ax[1].set_title('Leader of operators in Kraków')
plt.plot()
plt.show()
def visualisation_dominance(df):
fig, ax = plt.subplots(1, 2, figsize = (30, 10))
print('\nDominance of inpost points over all other operators - > 50%')
ctx.add_basemap(df[df.city == 'Warszawa'].plot(figsize=(20,10), column="dominance_inpost2", categorical = True, legend=True, cmap=binary_cmap, edgecolor='white', alpha = 0.7, ax = ax[0]))
ax[0].set_axis_off()
ax[0].set_title('Dominance of Inpost points in Warszawa')
ctx.add_basemap(df[df.city == 'Krakow'].plot(figsize=(20,10), column="dominance_inpost2", categorical = True, legend=True, cmap=binary_cmap, edgecolor='white', alpha = 0.7, ax = ax[1]))
ax[1].set_axis_off()
ax[1].set_title('Dominance of Inpost points in Kraków')
plt.plot()
plt.show()
def visualisation_total(df):
fig, ax = plt.subplots(1, 2, figsize = (30, 10))
print('\nTotal')
ctx.add_basemap(df[df.city == 'Warszawa'].plot(figsize=(20,10), column="tot", legend=True, cmap='Reds', edgecolor='white', alpha = 0.7, ax = ax[0]))
ax[0].set_axis_off()
ax[0].set_title('Total population in Warszawa')
ctx.add_basemap(df[df.city == 'Krakow'].plot(figsize=(20,10), column="tot", legend=True, cmap='Reds', edgecolor='white', alpha = 0.7, ax = ax[1]))
ax[1].set_axis_off()
ax[1].set_title('Total population in Kraków')
plt.plot()
plt.show()
def visualisation_forests(df):
fig, ax = plt.subplots(1, 2, figsize = (30, 10))
print('\nForests')
ctx.add_basemap(df[df.city == 'Warszawa'].plot(figsize=(20,10), column="forests", legend=True, cmap=greens, edgecolor='white', alpha = 0.7, ax = ax[0]))
ax[0].set_axis_off()
ax[0].set_title('Number of forests in Warszawa')
ctx.add_basemap(df[df.city == 'Krakow'].plot(figsize=(20,10), column="forests", legend=True, cmap=greens, edgecolor='white', alpha = 0.7, ax = ax[1]))
ax[1].set_axis_off()
ax[1].set_title('Number of forests in Kraków')
plt.plot()
plt.show()
def visualisation_schools(df):
fig, ax = plt.subplots(1, 2, figsize = (30, 10))
print('\nSchools')
ctx.add_basemap(df[df.city == 'Warszawa'].plot(figsize=(20,10), column="schools", legend=True, cmap=blues, edgecolor='white', alpha = 0.7, ax = ax[0]))
ax[0].set_axis_off()
ax[0].set_title('Number of schools in Warszawa')
ctx.add_basemap(df[df.city == 'Krakow'].plot(figsize=(20,10), column="schools", legend=True, cmap=blues, edgecolor='white', alpha = 0.7, ax = ax[1]))
ax[1].set_axis_off()
ax[1].set_title('Number of schools in Kraków')
plt.plot()
plt.show()
The figure below shows all cities with the background of OSM maps.
visualisation_basic(df_all)
The next figure shows the leader of the operators in each grid. We can see that Poczta Polska dominates the city center (blue color on the map). In turn, Inpost dominates around the center (pink color), i.e. around Poczta Polska points. DHL (red color) and other smaller operators (green color) are the leaders in a few points. On the outskirts of cities, as a rule, there is no leader (gray color).
visualisation_leader(df_all)
The figure below shows where Inpost dominates over other operators. Green color means that Inpost points account for more than 50% of all other operators. Referring to the previous figure, Inpost's dominance is mainly in those points where Inpost is the leader on its own.
visualisation_dominance(df_all)
Then we show how the demographic variable is distributed in cities. As expected, total population is the highest in city centers.
visualisation_total(df_all)
Taking into account forests points, their distribution on the map is very regual. It is noticeable that there is also a lot of green areas on the outskirts of cities.
visualisation_forests(df_all)
Finally, we show the location of the schools (elementary schools, kindergartens, etc.). As expected, the highest number of schools is in the center, where also the total population is highest.
visualisation_schools(df_all)
This chapter is devoted to Spatial statistical explanatory analysis. We utilize here many spatial statistic tools to understand better spatial dependencies in our datasets! We distinguished such Spatial SEDA stages like:
Importantly, in the study, we considered which weight matrix would be appropriate. We chose between contiguity by ROOK and contiguity by QUEEN. Finally, based on attempts at multivariate spatial econometric modeling, we concluded that QUEEN would be the appropriate matrix (due to spatial terms significance). This solution is basically intuitive in terms of the problem we are addressing.
import numpy as np
import pandas as pd
import geopandas as gpd
from shapely import geometry
from shapely import wkt
from libpysal.weights import contiguity
from libpysal.weights import lag_spatial
from pyproj import CRS
import mapclassify as mc
import splot
import esda
import seaborn as sns
from esda.moran import Moran
from splot.esda import moran_scatterplot
from splot.esda import plot_moran
import matplotlib.pyplot as plt
import gc
from esda.moran import Moran_Local
from splot.esda import lisa_cluster
from splot.esda import plot_local_autocorrelation
from splot.libpysal import plot_spatial_weights
from esda.moran import Moran_BV_matrix
from splot.esda import moran_facet
from shapely.ops import cascaded_union
from geovoronoi.plotting import subplot_for_map, plot_voronoi_polys_with_points_in_area
from geovoronoi import voronoi_regions_from_coords, points_to_coords
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
%config Completer.use_jedi = False
In the first step let us import datasets which are necessary to run analysis: final dataset for Warsaw, Cracow and GUGIK powiats!
def load_shp_from_csv(name:str):
df = pd.read_csv(f'../datasets/preprocessed_data/{name}.csv')
df['geometry'] = df['geometry'].apply(wkt.loads)
df = gpd.GeoDataFrame(df, crs='epsg:4258')
return df
df_warszawa = load_shp_from_csv("df_warszawa")
df_krakow = load_shp_from_csv("df_krakow")
pov = gpd.read_file("../datasets/raw_data/guigk_pov/Powiaty.shx", encoding='utf-8')
pov = pov.loc[pov.JPT_NAZWA_.isin(["powiat Warszawa", "powiat Kraków"])==True, ("JPT_NAZWA_", "geometry")]
pov = pov.to_crs("epsg:4258")
print(df_warszawa.shape, df_krakow.shape)
Here we define one utility function which allow us to run spatial SEDA per each city and stage. Of course in case of clean code it should we written in separate functions, but due to lack of time we left one function.
def spatial_seda(city, city_name, analysis_part):
df = city.copy()
df_facet = df[['poczta', 'dhl', 'inpost', 'dpd', 'ruch']]
# spatial weight matrix
y = df.inpost.values
w = contiguity.Queen.from_dataframe(df)
w.transform = 'r'
if analysis_part == 1:
plot_spatial_weights(w, df["geometry"].to_crs("epsg:3395"))
plt.title("Spatial weights visualization")
plt.show()
#general plot, quantiles plot, spatial lag plot
fig, axs = plt.subplots(1, 3, figsize=(30,10))
df.plot(column="inpost", legend=True, figsize=(10,10), ax=axs[0])
axs[0].set_title("InPost")
axs[0].legend(prop={'size': 2})
df.plot(column='inpost', scheme='Quantiles', k=5, legend=True, figsize=(10,10), ax=axs[1])
axs[1].set_title("InPost (Quintiles)")
ylag = lag_spatial(w, y)
ylagq5 = mc.Quantiles(ylag, k=4)
df.assign(cl=ylagq5.yb).plot(column='cl', categorical=True, k=4, linewidth=0.1, ax=axs[2], legend=True, cmap="viridis")
axs[2].set_title("Spatial Lag InPost (Quintiles)")
plt.show()
elif analysis_part == 2:
#moran global
mi = esda.moran.Moran(y, w)
print("Moran's I", mi.I, "\np-value of I under normality assumption", mi.p_norm, "\np-value based on permutations", mi.p_sim)
plot_moran(mi, zstandard=True, figsize=(10,4))
plt.show()
#geary global
geary = esda.geary.Geary(y, w)
print("Geary's: ", geary.C, "p-value:", geary.p_sim)
# joint count global
y_prim = (y > np.quantile(y, 0.85)) * 1
df["inpost_bin"] = y_prim
df.plot(column="inpost_bin", legend=True, figsize=(10,10))
plt.title("InPost in Joint Count analysis")
plt.show()
w.transform = 'O'
jc = esda.join_counts.Join_Counts(df.inpost_bin, w)
print("Joint count bb pvalue", jc.p_sim_bb, ";JC bw pvalue", jc.p_sim_bw)
w.transform = 'R'
elif analysis_part == 3:
#moran local
moran_loc = Moran_Local(y, w)
fig, ax = moran_scatterplot(moran_loc, p=0.1)
ax.set_xlabel('InPost')
ax.set_ylabel('Spatial Lag of InPost')
plt.show()
lisa_cluster(moran_loc, df.to_crs("epsg:3395"), p=0.1, figsize = (10, 10), legend=True, )
plt.show()
#moran bv analysis
matrix2 = Moran_BV_matrix(df_facet, w)
moran_facet(matrix2)
plt.show()
elif analysis_part == 4:
#tessellation
df_point = load_shp_from_csv("pickup_points_by_operator")
df_point = df_point[df_point.operator=="inpost"]
curr_pov = pov[pov.JPT_NAZWA_ == city_name]
df_point = gpd.sjoin(df_point, curr_pov, op='within', how='inner').to_crs(epsg=3395)
curr_pov = curr_pov.to_crs(epsg=3395)
boundary_shape = cascaded_union(curr_pov.geometry)
coords = points_to_coords(df_point.geometry)
region_polys, region_pts = voronoi_regions_from_coords(coords, boundary_shape)
fig, ax = subplot_for_map(figsize=(20,10))
plot_voronoi_polys_with_points_in_area(ax, boundary_shape, region_polys, coords, region_pts)
plt.show()
fig, ax = subplot_for_map(figsize=(20,10))
plot_voronoi_polys_with_points_in_area(ax, boundary_shape, region_polys, coords, points_markersize=0.1, voronoi_color='grey', voronoi_edgecolor='black')
plt.show()
df_point = load_shp_from_csv("pickup_points_by_operator")
stats_per_operator = list()
for operator in df_point.operator.unique():
df_point_copy = df_point.copy()
df_point_copy = df_point_copy[df_point_copy.operator==operator]
curr_pov = pov[pov.JPT_NAZWA_ == city_name]
df_point_copy = gpd.sjoin(df_point_copy, curr_pov, op='within', how='inner').to_crs(epsg=3395)
curr_pov = curr_pov.to_crs(epsg=3395)
boundary_shape = cascaded_union(curr_pov.geometry)
coords = points_to_coords(df_point_copy.geometry)
region_polys, region_pts = voronoi_regions_from_coords(coords, boundary_shape)
a = np.array([region_polys[i].area for i in range(len(region_polys))])
a1 = a/sum(a)
ent1 = sum(-1*a1*np.log(a1))
n = len(a)
ent_ref = np.log(1/n)*(-1)
ent_rel = ent1/ent_ref
stats_per_operator.append([ent1, ent_rel])
x = pd.DataFrame(stats_per_operator, index=df_point.operator.unique(), columns=["Shannon entropy", "Relative H entropy"])
display(x)
x["Relative H entropy"].plot(figsize=(10,5), title="Relative H entropy per operator")
plt.show()
We start the analysis by examining the neighbors formed by the spatial weight matrix. We can see that for QUEEN contiguity was well applied.
Then we move to visualizations of InPost pick-up points. The first visualization shows the spatial distribution of InPost pickup points assuming it is a continuous variable. The next plots divides this variable into quintiles, while the last visualization is a spatial lag visualization divided into quintiles. The division into quintiles allows us to better observe the density of InPost pickup points, while the spatial lag, in addition to the natural econometric interpretation (variable that averages the neighboring values of a location), gives us the opportunity to smooth the quantile graph and enhance the impression of value similarity in space. Generally, we can clearly see that the most pickup points are located in the city center, in the Praga's center and on the wester part. On the south and east there are still places with 0 pickup points.
spatial_seda(df_warszawa, "powiat Warszawa", 1)
Then we move to the global spatial autocorrelation analysis.
First we checked Moran's I statistic. Based on its p-value and statistic we can claim that there is positive spatial autocorrelation in the InPost data. We utilized also Moran Scatterplot to visualize the relation between areas.
Then we checked whether Geary's statistic also indicates positive spatial autocorrelation, and based on its output we also claim that there is spatial autocorrelation in the data.
Last but not least we applied Joint Count test for binary case (85 percentile as as cut-off point for dependent variable) and it also indicated that the number of InPost's pick-up points in grids are not randomly distributed in case of spatial term.
spatial_seda(df_warszawa, "powiat Warszawa", 2)
In the next step we utilized Moran Local measure (LISA) to obtain Moran Local Scatterplot. In Python pysal package it allow us to visualize Hot Spots (HH), Cold Spots (LL), and Spatial Outliers (HL, LH). Again we see that city center is the Hot Spot. It is evident that bedroom suburbs are classified as Cold Spots. HL spots (so called diamons) occur in or near bedroom suburbs, while LH spots occur near the city center. This conclusions are quite intuitive.
Then we ploted bivariate moran scatterplot for our logistic companies. This illustrates that relationships between these variables do indeed exist.
spatial_seda(df_warszawa, "powiat Warszawa", 3)
Importantly, here we switched to point data from the grid data!!!
The final piece of SEDA is an attempt to use the tessellation mechanism to visualize and examine the agglomeration per each logistic company. Visualizations presented below perfectly show InPost pick-up points concentration and leads to the same conclusions as previously!
Then we utilized Shannon entropy and Relative H entropy to analyse agglomeration per each logistic company. We can see that DHL and Poczta Polska are top agglomerated companies. InPost is in the middle of the ranking. It seems to be the right place because too high agglomeration has a negative impact on accessibility in all parts of Warsaw.
spatial_seda(df_warszawa, "powiat Warszawa", 4)
In the Cracow case weight matrix based on QUEEN contiguity works also better that ROOK (in econometric part). Based on the first plot we see that it was correctly calculated.
Then we analyze visualization of InPost in continuous, quintiles and spatial lag quintiles manor. We can clearly see that pick-up points are located strictly in the city center. The lowest level of coverage is in the eastern and western parts of the city (with an emphasis on the eastern).
spatial_seda(df_krakow, "powiat Kraków", 1)
Global Spatial Autocorrelation measures like Moran's I and Geary's indicates positive spatial autocorrelation (statistically significant) in the InPost data. Also Joint Count shows spatial autocorrelation for InPost pickup-points.
spatial_seda(df_krakow, "powiat Kraków", 2)
LISA allowed us to visualize HH, LL, HL, LH spots in the Cracow. Based on it we can claim that all HH are located in the city center. There are only 2 HL (diamonds). Wester and Easter part of the Cracow are definitely LL.
Then we ploted bivariate moran scatterplot for our logistic companies. This illustrates that relationships between these variables do indeed exist.
spatial_seda(df_krakow, "powiat Kraków", 3)
Importantly, here we switched to point data from the grid data!!!
Points visualization using tessellation technique presented below perfectly show InPost pick-up points concentration in the city center, thus it leads to the same conclusions as previously!
Then we utilized Shannon entropy and Relative H entropy to analyse agglomeration per each logistic company. We can see that Ruch, Poczta Polska and DHL are top agglomerated companies. InPost again is in the middle of the ranking. It seems to be the right place because too high agglomeration has a negative impact on accessibility in all parts of Warsaw.
spatial_seda(df_krakow, "powiat Kraków", 4)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid')
% matplotlib inline
# libraries for geo transformation to save file to shp format
import shapely.wkt
import geopandas as gpd
In this part of the study, we conduct non spatial exploratory data analysis. First, we compute descriptive stats. Then we check which variables should be removed from the dataset. For the remaining variables, we check their distributions with histograms and box plots. Next we calculate the correlation between the variables according to their category, i.e. operators, demographic and points of interest data. Finally, we transform variables if necessary - logarithmization and binarization of variables.
df_warszawa = pd.read_csv("../datasets/preprocessed_data/df_warszawa.csv")
df_krakow = pd.read_csv("../datasets/preprocessed_data/df_krakow.csv")
df_warszawa['city'] = 'Warszawa'
df_krakow['city'] = 'Krakow'
df = pd.concat([df_warszawa, df_krakow])
df = df.reset_index()
df = df.drop(columns = {'index'})
features = df.columns.to_list()
features.remove('grid_index')
features.remove('city')
features.remove('geometry')
Considering the descriptive stats table, we can see that we have a division into count data (operators and points of interest data) and a continuous variable (demographic data). In the next steps, we will examine the distributions of these variables and make some transformations.
df[features].describe()
The cell below shows what percentage of zero values each variable has in our dataset. If the ratio of zeroes for some variable is higher than 90%, such a variable is removed from our dataset.
to_remove = []
for i in features:
print("\n ############ \nFeature:", i)
print(round(df[i].value_counts()[0]/df.shape[0]*100,2), f"% of grids with 0 {i}.")
print(df[i].nunique(), "unique values.")
if(round(df[i].value_counts()[0]/df.shape[0]*100,2)>90):
to_remove.append(i)
features.remove(i)
# Variables to_remove
to_remove
# dropping columns
df = df.drop(columns = to_remove)
Using histgrams and a plot box, we examine the distributions of all variables. As expected, most features have similar distributions - very many low values and a very long tail with high values.
plt.rcParams["figure.figsize"] = (20, 5)
for i in features:
print(f"##############\n{i}\n##############")
fig, ax = plt.subplots(1,2)
sns.histplot(x = df[i], kde=False, edgecolor= "black", color = 'indianred', linewidth= 1.5, ax=ax[0])
sns.boxplot(x = df[i], color = 'lightgreen', ax=ax[1])
plt.show()
Next, we examine the correlation of the variables. Due to the relatively large dataset, we will divide the correlation study into three smaller sets according to the characteristics of the variables.
df[features].corr()
plt.rcParams["figure.figsize"] = (20, 16)
sns.heatmap(df[features].corr(), annot=True, cmap = 'Reds')
plt.show()
Division into three smaller sets
operators = ['dhl','dpd','inpost','poczta','ruch','ups']
demo = ['tot','tot_0_14','tot_15_64','tot_65__','tot_male','tot_fem','male_0_14','male_15_64','male_65__','fem_0_14','fem_15_64','fem_65__','fem_ratio']
points = ['buildings','shops','parks','forests','schools','railways','cycleways','parkings','crossings']
The correlation between DHL and Poczta Polska is disturbing - as much as 93%! In the next stages of the study, we will decide which variable should possibly be removed during modeling. It should also be noted the relatively high correlation between Poczta Polska and Ruch and between Inpost and DHL.
plt.rcParams["figure.figsize"] = (15, 8)
sns.heatmap(df[operators].corr(), annot=True, cmap = 'Reds')
plt.show()
The correlation between the demographic variables is extremely high. Only the variable that determines the "female ratio" is not correlated with the other variables. Most of these variables should be removed in the subsequent phases of the study.
plt.rcParams["figure.figsize"] = (15, 8)
sns.heatmap(df[demo].corr(), annot=True, cmap = 'Reds')
plt.show()
The last correlation plot is for the points of interest variables. The highest correlation between the variables is for "parkings" and "schools" ~ 74%.
plt.rcParams["figure.figsize"] = (15, 8)
sns.heatmap(df[points].corr(), annot=True, cmap = 'Reds')
plt.show()
The function below checks if a variable should be binarized (according to the rule, when the most common value for a variable is over 33% of all observations). If a variable is binarized, the new variable is visualized and the new variable is appended to the entire dataset.
def visualisation_binary(df, var):
new_name = var + '_binary'
if(df[var].value_counts().iloc[0] > int(df.shape[0]*0.33)):
print('\nnew variable: "' + new_name + '"')
pyk = df[var].value_counts().index[0]
df[new_name] = df[var].apply(lambda x: 'few_'+var if x == 0 else 'more_'+var)
plt.figure(figsize=(10, 5))
df[new_name].value_counts().plot(kind='bar')
plt.plot()
plt.show()
else: print(f'\nno binary transformation for variable "{var}"')
sns.set(style='whitegrid', palette="deep", font_scale=1.1, rc={"figure.figsize": [10, 5]})
for point in points:
visualisation_binary(df, point)
Then we log the variables for demographic features (due to the fact that they are continuous and their distributions have a long tail). The left-hand side shows the distributions before the transformation, while the right-hand side shows the variables after the transformation.
def visualisation_log(df, var):
new_name = var + 'log'
df[new_name] = np.log(1+df[var])
fig, ax = plt.subplots(1,2)
sns.histplot(x = df[var], kde=False, edgecolor= "black", color = 'indianred', linewidth= 1.5, ax=ax[0]).set_title(f"{var}", fontsize=20)
sns.histplot(x = df[new_name], kde=False, edgecolor= "black", color = 'lightgreen', linewidth= 1.5, ax=ax[1]).set_title(f"Logarithm of {var}", fontsize=20)
plt.plot()
plt.show()
demo_logarithms = ['tot','tot_0_14','tot_15_64','tot_65__','tot_male','tot_fem','male_0_14','male_15_64','male_65__','fem_0_14','fem_15_64','fem_65__']
sns.set(style='whitegrid', palette="deep", font_scale=1.1, rc={"figure.figsize": [20, 5]})
for point in demo_logarithms:
visualisation_log(df, point)
df.to_csv("../datasets/preprocessed_data/df_after_transform.csv", index=False)
Above transformation will also be applied for the GWR datasets.
df_warszawa = pd.read_csv("../datasets/preprocessed_data/df_warszawa_gwr.csv")
df_krakow = pd.read_csv("../datasets/preprocessed_data/df_krakow_gwr.csv")
df_warszawa['city'] = 'Warszawa'
df_krakow['city'] = 'Krakow'
df_gwr = pd.concat([df_warszawa, df_krakow])
df_gwr = df_gwr.reset_index()
df_gwr = df_gwr.drop(columns = {'index'})
demo_logarithms_gwr = ['TOT','TOT_0_14','TOT_15_64','TOT_65__','TOT_MALE','TOT_FEM','MALE_0_14','MALE_15_64','MALE_65__','FEM_0_14','FEM_15_64','FEM_65__']
for point in points:
visualisation_binary(df_gwr, point)
for point in demo_logarithms_gwr:
visualisation_log(df_gwr, point)
df_gwr['geometry'] = df_gwr.apply(lambda x: shapely.wkt.loads(x['center']), axis=1)
df_gwr = gpd.GeoDataFrame(df_gwr, geometry = 'geometry', crs = "epsg:4258")
df_gwr.to_file("../datasets/preprocessed_data/df_gwr_after_transform.shp")
df_gwr[df_gwr.city == 'Warszawa'].to_file("../datasets/preprocessed_data/df_warszawa_for_gwr.shp")
df_gwr[df_gwr.city == 'Krakow'].to_file("../datasets/preprocessed_data/df_krakow_for_gwr.shp")
Warning! This chapter is realized in R
In this chapter we create spatial dependence econometric models to answer our research hypothesis. We believe based on spatial autocorrelation tests (performed in previous chapters) that spatial autocorrelation exists in the analyzed phenomenon. We create models for Warsaw and Cracow separately. We utilize here 1km2 grid!
Generally, we specific following functional form of the models: $inpost_i = \alpha * \text{logistic_companies}_i + \beta * \text{demographic_variables}_i + \gamma * \text{point_of_interest_variables}_i + \theta * \text{spatial_terms}_i + \epsilon$,
where:
Obviously, a significant proportion of these variables will be removed from the final model due to high co-linearity or statistical insignificance, however, based on the literature and our intuition, we assume that at least 1-2 variables from each category will ultimately be significant.
Our expectations for the results are as follows (taking into account the research hypotheses). Based on the visual analysis of spatial data carried out in the previous chapters of this work, it seems to us that InPost has deployed parcel pick-up points in accordance with the competition. Therefore, we expect the variables in the logistic_companies category to be significant and have a positive sign. In addition, it seems to us that the control variables (groups: demographic_variables, point_of_interest_variables) should also be important in the models and their sign should follow the logic: the more shops in the grid, the more parcel machines, the greater the afforestation, the fewer parcel machines, the more bus stops the more parcel machines, the higher the population density, the more parcel machines etc. In addition, we believe that extending the OLS model with spatial components is a good idea and models that take into account spillover effects will be econometrically better. We apply these expectations to both Warsaw and Cracow.
We adopted the following modeling strategy due to the complexity of the problem:
We covered all possible options to improve the models:
We set 10% significance level!
library(rgdal)
library(spdep)
library(tidyverse)
library(car)
library(lmtest)
library(arules)
library(spatialreg)
library(texreg)
library(stargazer)
df <- readOGR("../datasets/preprocessed_data/df_warszawa.shp")
df_data <- read.csv("../datasets/preprocessed_data/df_warszawa.csv")
df <- spTransform(df, CRS("+proj=longlat +ellps=GRS80 +no_defs"))
df.cont.nb <- poly2nb(as(df, "SpatialPolygons"), queen=TRUE) #QUEEN but ROOK was tried!
df.cont.listw <- nb2listw(df.cont.nb, style="W")
cat(colnames(df_data))
We investigate again correlation between logistic companies. We can clearly see that Poczta Polska is highly collinear with other variables so we will remove it from the further analysis. The rest variables seems to be ok.
stats::cor(df_data %>% select('dhl', 'dpd', 'fedex', 'inpost', 'poczta', 'ruch', 'ups'))
Potential enhancement of the features.
df$tot_0_14 <- df$tot_0_14/df$tot
df$tot_15_64 <- df$tot_15_64/df$tot
df$tot_65__ <- df$tot_65__/df$tot
df$male_0_14 <- df$male_0_14/df$tot_male
df$male_15_64 <- df$male_15_64/df$tot_male
df$male_65__ <- df$male_65__/df$tot_male
df$fem_0_14 <- df$fem_0_14/df$tot_fem
df$fem_15_64 <- df$fem_15_64/df$tot_fem
df$fem_65__ <- df$fem_65__/df$tot_fem
Based on many tries we decided that below formula is the most appropriate general formula for the given problem. We removed Poczta Polska from logistic companies, we left total density and feminization ratio in case of demographic data and finally we leave all data regarding points of interest.
formula = inpost ~ dhl + dpd + fedex + ruch + ups +
log(tot + 1) + fem_ratio +
buildings + shops + parks + forests + schools +
railways + cycleways + parkings + crossings + bus_stops
We estimate OLS model and we perform its diagnostics. We see that nearly 50% of the variables are significant (2 or more per each category of data). We do not have now high collinearity. However this model did not pass Ramsey and Breusch–Pagan test so the functional form is wrong and we have got heteroscedasticity in the errors. It occurred that at the level of 10% significance our residuals are spatially dependent!
ols <- lm(formula, data=df)
summary(ols)
car::vif(ols)
bptest(ols)
resettest(ols, power=2, type="regressor")
lm.morantest(ols, df.cont.listw)
Then we applied Stepwise Regression procedure to obtain the most significant model starting from the function form defined above.
intercept_only <- lm(inpost~1, data=df)
all <- lm(formula, data=df)
both <- step(intercept_only, direction='both', scope=formula(all), trace=0)
both
Based on Stepwise Regression we obtained final formula of the model and we are able to run our final OLS with diagnostic part. We see that nearly all variables in the model are significant (of course due to the heteroscedasticity we use biased error variance estimator - we omitted application of the robust matrix due to lack of time). There is no collinearity in the model, but again we did not pass Ramsey and B-P tests. What is more residuals from the model are still spatially dependent. In case of interpretation we can see that: logistic companies have positive sign as expected; the greater the total density the more InPost pick-up points are stated; fem_ratio results is interesting because its sign is negative (more woman less pick-up points, quite spurious?!), in addition, the negative relationship between parks and pickup points is interesting but maybe intuitive.
formula = inpost ~ dhl + schools + log(tot + 1) + ruch + parks +
ups + crossings + fem_ratio + fedex
ols <- lm(formula, data=df)
summary(ols)
cat("AIC", AIC(ols))
car::vif(ols)
bptest(ols)
resettest(ols, power=2, type="regressor")
lm.morantest(ols, df.cont.listw)
Now we moved to the estimation of the spatial models. In the first step we estimate all possible models due to spatial terms and then we analyze which model is reasonable (based on number of significant variables, significant spatial terms and AIC+BIC).
GNS <- sacsarlm(formula, data=df, listw=df.cont.listw, type="sacmixed", method="LU")
summary(GNS)
SAC <- sacsarlm(formula, data=df, listw=df.cont.listw)
summary(SAC)
SDM <- errorsarlm(formula, data=df, listw=df.cont.listw, etype="mixed")
summary(SDM)
SDEM <- errorsarlm(formula, data=df, listw=df.cont.listw, etype="emixed")
summary(SDEM)
SEM <- errorsarlm(formula, data=df, listw=df.cont.listw)
summary(SEM)
SAR <- lagsarlm(formula, data=df, listw=df.cont.listw)
summary(SAR)
SLX <- lmSLX(formula, data=df, listw=df.cont.listw)
summary(SLX)
Based on the above outputs we can clearly state that the best spatial model is SAR (Spatial Lag Model) - significant spatial terms and many significant variables.
Then we compare models BIC and AIC. We can clearly see that the best model in comparison to the OLS is SAR, however its BIC is greater that OLS BIC. In case of AIC which is more important the best model is SLX. However based on its output we can claim that SAR is still better in case of econometric inference.
cat("ols",BIC(ols) ,'\n')
cat("GNS",BIC(GNS),'\n')
cat("SAC",BIC(SAC),'\n')
cat("SDM",BIC(SDM),'\n')
cat("SDEM",BIC(SDEM),'\n')
cat("SAR",BIC(SAR),'\n')
cat("SLX",BIC(SLX),'\n')
cat("SEM",BIC(SEM),'\n')
cat('\n')
cat("ols",AIC(ols) ,'\n')
cat("GNS",AIC(GNS),'\n')
cat("SAC",AIC(SAC),'\n')
cat("SDM",AIC(SDM),'\n')
cat("SDEM",AIC(SDEM),'\n')
cat("SAR",AIC(SAR),'\n')
cat("SLX",AIC(SLX),'\n')
cat("SEM",AIC(SEM),'\n')
Now we wanted to find better functional form for the SAR model in the general to specific procedure (LSA) done by hand, however we were not able to improve this model (our previous functional form was the best one). Residuals in the model are not autocorrelated.
formula_max = inpost ~ dhl + dpd + fedex + ruch + ups +
log(tot + 1) + fem_ratio +
buildings + shops + parks + forests + schools +
railways + cycleways + parkings + crossings + bus_stops
formula_gen_to_specifc = inpost ~ dhl + fedex + ruch + ups +
log(tot + 1) + fem_ratio +
parks + schools +
crossings
SAR_g2s<- lagsarlm(formula_gen_to_specifc, data=df, listw=df.cont.listw)
summary(SAR_g2s) #the same as previously!
Now we calculate direct and indirect impacts of the model. And we finally analyze its output.
Let's start with logistic companies variables. We see that either in case of direct and indirect impacts all parameters for logistic companies (apart indirect Fedex) are significant (at least around 10%). Internalization effects are greater than spill-over, but still spill-over effects are significant. In case of demographic variables we claim that total population density is important variable for the model in case of direct and indirect impact (has positive sign). Feminization ratio in case of direct impact is also significant and has negative sign (it is quite non intuitive). In case of point of interest data we see that number of schools, number of parks and number of crossings are significant in the model (crossings spillover effect is not relevant) with respectively positive, negative and positive sign. We claim that it might be intuitive.
# distribution of total impact
# vector of traces of powers of a spatial weights matrix
# converting censored listw to CsparseMatrix form
W.c <- as(as_dgRMatrix_listw(df.cont.listw), "CsparseMatrix")
# the default values for the number of powers is 30
trMat<-trW(W.c, type="mult")
SAR_imp<-impacts(SAR, tr=trMat, R=200)
summary(SAR_imp, zstats=TRUE, short=TRUE)
For the Cracow we applied the same procedure as for Warsaw, so we will omit here additional comments and focus only on the results.
df <- readOGR("../datasets/preprocessed_data/df_krakow.shp")
df_data <- read.csv("../datasets/preprocessed_data/df_krakow.csv")
df <- spTransform(df, CRS("+proj=longlat +ellps=GRS80 +no_defs"))
df.cont.nb <- poly2nb(as(df, "SpatialPolygons"), queen=TRUE) #QUEEN but ROOK was tried!
df.cont.listw <- nb2listw(df.cont.nb, style="W")
cat(colnames(df_data))
stats::cor(df_data %>% select('dhl', 'dpd', 'fedex', 'inpost', 'poczta', 'ruch', 'ups'))
df$tot_0_14 <- df$tot_0_14/df$tot
df$tot_15_64 <- df$tot_15_64/df$tot
df$tot_65__ <- df$tot_65__/df$tot
df$male_0_14 <- df$male_0_14/df$tot_male
df$male_15_64 <- df$male_15_64/df$tot_male
df$male_65__ <- df$male_65__/df$tot_male
df$fem_0_14 <- df$fem_0_14/df$tot_fem
df$fem_15_64 <- df$fem_15_64/df$tot_fem
df$fem_65__ <- df$fem_65__/df$tot_fem
formula = inpost ~ dhl + dpd + fedex + ruch + ups +
log(tot + 1) + fem_ratio +
buildings + shops + parks + forests + schools +
railways + cycleways + parkings + crossings + bus_stops
ols <- lm(formula, data=df)
summary(ols)
car::vif(ols)
bptest(ols)
resettest(ols, power=2, type="regressor")
lm.morantest(ols, df.cont.listw)
intercept_only <- lm(inpost~1, data=df)
all <- lm(formula, data=df)
both <- step(intercept_only, direction='both', scope=formula(all), trace=0)
Based on Stepwise Regression we obtained final formula of the model and we are able to run our final OLS with diagnostic part. We see that nearly all variables in the model are significant (of course due to the heteroscedasticity we use biased error variance estimator - we omitted application of the robust matrix due to lack of time). There is no collinearity in the model, but again we did not pass Ramsey and B-P tests. What is more residuals from the model are not spatially dependent. In case of interpretation we can see that: logistic companies have positive sign as expected; the greater the total density the more InPost pick-up points are stated; fem_ratio results is interesting because its sign is negative (more woman less pick-up points), in addition, the negative relationship between parks and pickup points is interesting but intuitive. There is positive relationship between pickup points and schools, parkings, cycleways and bus stops. It is also intuitive!
formula = formula(both)
ols <- lm(formula, data=df)
summary(ols)
cat("AIC", AIC(ols))
car::vif(ols)
bptest(ols)
resettest(ols, power=2, type="regressor")
lm.morantest(ols, df.cont.listw)
GNS <- sacsarlm(formula, data=df, listw=df.cont.listw, type="sacmixed", method="LU")
summary(GNS)
SAC <- sacsarlm(formula, data=df, listw=df.cont.listw)
summary(SAC)
SDM <- errorsarlm(formula, data=df, listw=df.cont.listw, etype="mixed")
summary(SDM)
SDEM <- errorsarlm(formula, data=df, listw=df.cont.listw, etype="emixed")
summary(SDEM)
SEM <- errorsarlm(formula, data=df, listw=df.cont.listw)
summary(SEM)
SAR <- lagsarlm(formula, data=df, listw=df.cont.listw)
summary(SAR)
SLX <- lmSLX(formula, data=df, listw=df.cont.listw)
summary(SLX)
Again SAR model seems to be the best in case of spatial terms significance and number of relevant variables! For Cracow BIC and AIC results also indicates SAR as the best model!
cat("ols",BIC(ols) ,'\n')
cat("GNS",BIC(GNS),'\n')
cat("SAC",BIC(SAC),'\n')
cat("SDM",BIC(SDM),'\n')
cat("SDEM",BIC(SDEM),'\n')
cat("SAR",BIC(SAR),'\n')
cat("SLX",BIC(SLX),'\n')
cat("SEM",BIC(SEM),'\n')
cat('\n')
cat("ols",AIC(ols) ,'\n')
cat("GNS",AIC(GNS),'\n')
cat("SAC",AIC(SAC),'\n')
cat("SDM",AIC(SDM),'\n')
cat("SDEM",AIC(SDEM),'\n')
cat("SAR",AIC(SAR),'\n')
cat("SLX",AIC(SLX),'\n')
cat("SEM",AIC(SEM),'\n')
Now we looked for better formula for SAR and we found it in the general to specific procedure.
formula_max = inpost ~ dhl + dpd + fedex + ruch + ups +
log(tot + 1) + fem_ratio +
buildings + shops + parks + forests + schools +
railways + cycleways + parkings + crossings + bus_stops
formula_gen_to_specifc = inpost ~ dhl + dpd + ruch + ups +
log(tot + 1) + fem_ratio +
parks + schools +
cycleways + parkings
SAR_g2s<- lagsarlm(formula_gen_to_specifc, data=df, listw=df.cont.listw)
summary(SAR_g2s) #the same as previously!
Now we move to the statistical inference part.
Let's start with logistic companies variables. We see that either in case of direct and indirect impacts all parameters for logistic companies (apart from Ruch total, and UPS indirect) are significant (at least around 10%). Internalization effects are greater than spill-over, but still spill-over effects are significant. In case of demographic variables we claim that total population density is important variable for the model in case of direct and indirect impact (has positive sign). Feminization ratio in case of direct impact is also significant and has negative sign (it is quite non intuitive). In case of point of interest data we see that number of schools, number of parks, number of cycleways and number of parkings are significant in the model with respectively positive, negative positive, and positive sign. We claim that it is intuitive.
# distribution of total impact
# vector of traces of powers of a spatial weights matrix
# converting censored listw to CsparseMatrix form
W.c <- as(as_dgRMatrix_listw(df.cont.listw), "CsparseMatrix")
# the default values for the number of powers is 30
trMat<-trW(W.c, type="mult")
SAR_imp<-impacts(SAR_g2s, tr=trMat, R=200)
summary(SAR_imp, zstats=TRUE, short=TRUE)
We combine our statistical inference for Warsaw and Cracow. Base of above research we can generally claim that InPost has deployed parcel pick-up points in line with the competitors - as the number of InPost pick-up points increases, the number of competitors' pick-up points increases (ignoring competitors not significant in the regression). We demonstrate that our control variables from the demographic group and points of interest are also significant, so InPost follows modeling logic and literature guidance for instance such variables are positively related with number of InPost's pick-up points: number of cycleways, number of parkings, number of schools, number of crossings. As it turned out a significant spatial effect in multivariate econometric models is Spatial Lag. It is quite intuitive that spillover effect exists for this problem.
library(spdep)
library(rgdal)
library(maptools)
library(sp)
library(RColorBrewer)
library(classInt)
library(GISTools)
library(maps)
library(spgwr)
library(factoextra)
library(NbClust)
library(repr)
library(stargazer)
options(warn = -1)
In this part of the study we apply Geographically Weighted Regression (GWR) method based on Mueller et al. (2013). This research uses clusterization methods.
Contrary to the previous models, we will not use data for 1 km2 grids, but only for inpost points. Thus, our points on the map will be all inpost machines for Warszawa and Kraków. For the variables for given points, we decided to create 500 meters buffers around a given point, for which: we count the operator's points and points of interest within this range, and we also collected data for grids (if more than one grid coincided with the buffer, then we took the average for grid data). As the target variable, we chose the number of inpost parcel machines around each Inpost point.
df_points_Warszawa<-readOGR(".", "df_warszawa")
crds<-coordinates(df_points_Warszawa)
names(df_points_Warszawa)[3] <- 'inpost'
names(df_points_Warszawa)[4] <- 'poczta'
names(df_points_Warszawa)[5] <- 'dhl'
names(df_points_Warszawa)[6] <- 'ruch'
names(df_points_Warszawa)[7] <- 'dpd'
names(df_points_Warszawa)[8] <- 'ups'
names(df_points_Warszawa)[22] <- 'fem_ratio'
names(df_points_Warszawa)[36] <- 'tot_log'
names(df_points_Warszawa)[44] <- 'male_65_log'
df_Warszawa <- as.data.frame(df_points_Warszawa)
df_Warszawa$inpost <- as.numeric(df_Warszawa$inpost)
options(repr.plot.width=4, repr.plot.height=4)
ggplot() + geom_point(data=df_Warszawa, aes(long, lat), size = 2.5, alpha = 0.7)
We do not include variables with too high correlation. According to non spatial explanatory analysis we remove most of the demographic variables.
# equation
eq <- inpost ~ poczta + dhl + dpd + ruch + ups +
buildings + parks + forests + schools + railways + cycleways + parkings + crossings +
tot_log + male_65_log + fem_ratio
Commented out due to the time of execution.
# optimum bandwidth
# bw <- ggwr.sel(eq, data=df_Warszawa, coords=crds, family=poisson(), longlat=TRUE)
bw_Warszawa = 2.20067014501355
# GWR model
model.ggwr_Warszawa<-ggwr(eq, data=df_Warszawa, coords=crds, family=poisson(), longlat=TRUE, bandwidth=bw_Warszawa)
model.ggwr_Warszawa
features <- c('poczta', 'dhl', 'dpd', 'ruch', 'ups', 'buildings', 'parks',
'forests', 'schools', 'railways', 'cycleways', 'parkings', 'crossings',
'tot_log', 'male_65_log', 'fem_ratio')
options(repr.plot.width=5, repr.plot.height=3)
layout(matrix(1:16, 1, 16))
for (i in features){
g <- ggplot() +
geom_point(data = df_Warszawa, aes(long, lat, colour = model.ggwr_Warszawa$SDF[[i]]), size = 3, alpha = 0.7) +
labs(title = paste("GWR for:", i, "variable")) +
guides(colour = guide_legend(title = "GWR coeff"))
print(g)
}
Taking into account the above visualizations, it should be noted that spatial differences between different areas in Warsaw are imperceptible for most of the variables. In the case of the coefficients for the operators: for Poczta Polska, DHL, DPD and UPS, the coefficients are rather negative or zero, while for Ruch they are rather positive. Regarding points of interest, most of the coefficients hover around zero. In the case of demographic variables, there is a variable total population, for which the ratios are mostly positive, while for the population of men over 65, this variable has a negative ratio.
# clustering of GWR coefficients
fviz_nbclust(as.data.frame(model.ggwr_Warszawa$SDF[,2:18]), FUNcluster=kmeans)
Despite the fact that the figure above shows the optimal number of 2 clusters, we decide to choose 5 clusters, due to the slightly lower average silhouette and more interesting analysis than for 2 clusters only.
clusters1<-kmeans(as.data.frame(model.ggwr_Warszawa$SDF[,2:18]), 5)
options(repr.plot.width=5, repr.plot.height=3)
ggplot(data=df_Warszawa, aes(long, lat)) + geom_point(aes(colour = as.factor(clusters1$cluster)), size = 2.5, alpha = 0.7) +
labs(title = '5 clusters, results from kmeans()')
In the figure above, we can see a clear division into 2 central clusters (second and fourth) and 3 smaller clusters on the outskirts of Warsaw (first, third and fifth cluster).
df_Warszawa$clust1<-rep(0, times=dim(df_Warszawa)[1])
df_Warszawa$clust1[clusters1$cluster==1]<-1
df_Warszawa$clust2<-rep(0, times=dim(df_Warszawa)[1])
df_Warszawa$clust2[clusters1$cluster==2]<-1
df_Warszawa$clust3<-rep(0, times=dim(df_Warszawa)[1])
df_Warszawa$clust3[clusters1$cluster==3]<-1
df_Warszawa$clust4<-rep(0, times=dim(df_Warszawa)[1])
df_Warszawa$clust4[clusters1$cluster==4]<-1
df_Warszawa$clust5<-rep(0, times=dim(df_Warszawa)[1])
df_Warszawa$clust5[clusters1$cluster==5]<-1
eq1 <- inpost ~ poczta + dhl + dpd + ruch + ups +
buildings + parks + forests + schools + railways + cycleways + parkings + crossings +
tot_log + male_65_log + fem_ratio + clust1 + clust2 + clust3 + clust4
model.ols_Warszawa<-lm(eq1, data=df_Warszawa)
summary(model.ols_Warszawa)
Due to the large number of insignificant variables, we eliminated these variables (except clusters dummies).
eq2 <- inpost ~ ruch + ups +
parks + cycleways +
tot_log + clust1 + clust2 + clust3 + clust4
model.ols_Warszawa2<-lm(eq2, data=df_Warszawa)
summary(model.ols_Warszawa2)
Adjusted R2 for "Warszawa model" is equal to 17.5%. It is relatively low value. The reason for that could be probably irregular location of Inpost parcels machines in Warszawa. Regarding this model it is important that all variables are jointly significant. Analyzing the coefficients, it should be noted, that all coefficients for clusters dummies are insignificant - in this case division to the clusters is not needed (homogeneity in Warsaw). In the case of positive significant coefficients: the more Ruch and UPS points and the higher the total population, the more Inpost points. Considering negative significant variables: more parks and cycleways cause the lower number of Inpost points.
df_points_Krakow<-readOGR(".", "df_krakow")
crds<-coordinates(df_points_Krakow)
names(df_points_Krakow)[3] <- 'inpost'
names(df_points_Krakow)[4] <- 'poczta'
names(df_points_Krakow)[5] <- 'dhl'
names(df_points_Krakow)[6] <- 'ruch'
names(df_points_Krakow)[7] <- 'dpd'
names(df_points_Krakow)[8] <- 'ups'
names(df_points_Krakow)[22] <- 'fem_ratio'
names(df_points_Krakow)[36] <- 'tot_log'
names(df_points_Krakow)[44] <- 'male_65_log'
df_Krakow <- as.data.frame(df_points_Krakow)
df_Krakow$inpost <- as.numeric(df_Krakow$inpost)
options(repr.plot.width=4, repr.plot.height=4)
ggplot() + geom_point(data=df_Krakow, aes(long, lat), size = 2.5, alpha = 0.7)
The same set of variables as for Warszawa
# equation
eq <- inpost ~ poczta + dhl + dpd + ruch + ups +
buildings + parks + forests + schools + railways + cycleways + parkings + crossings +
tot_log + male_65_log + fem_ratio
Commented out due to the time of execution.
# # optimum bandwidth
# bw_Krakow <- ggwr.sel(eq, data=df_Krakow, coords=crds, family=poisson(), longlat=TRUE)
bw_Krakow = 1.41653116071638
model.ggwr_Krakow<-ggwr(eq, data=df_Krakow, coords=crds, family=poisson(), longlat=TRUE, bandwidth=bw_Krakow)
model.ggwr_Krakow
features <- c('poczta', 'dhl', 'dpd', 'ruch', 'ups', 'buildings', 'parks',
'forests', 'schools', 'railways', 'cycleways', 'parkings', 'crossings',
'tot_log', 'male_65_log', 'fem_ratio')
options(repr.plot.width=5, repr.plot.height=3)
layout(matrix(1:16, 1, 16))
for (i in features){
g <- ggplot() +
geom_point(data = df_Krakow, aes(long, lat, colour = model.ggwr_Krakow$SDF[[i]]), size = 3, alpha = 0.7) +
labs(title = paste("GWR for:", i, "variable")) +
guides(colour = guide_legend(title = "GWR coeff"))
print(g)
}
Considering the above visualizations for Kraków, it should be noted that we have greater spatial variability between different areas than in Warszawa. In the case of coefficients for operators: most of the coefficients are around zero, however there are areas with positive or negative coefficients. Regarding points of interest, variables defining parks and forests are mostly negative, for the rest points of interest coefficients are around zero. In the case of demographic variables, again, as in the case of Warsaw, for total population, the coefficients are mostly positive (however with some 'negative' areas), while for the population of men over 65, this variable has a positive value for relatively many points.
# clustering of GWR coefficients
fviz_nbclust(as.data.frame(model.ggwr_Krakow$SDF[,2:18]), FUNcluster=kmeans)
Despite the fact that the figure above shows the optimal number of 2 clusters, we decide to choose 4 clusters, due to the slightly lower average silhouette and more interesting analysis than for 2 clusters only.
clusters1<-kmeans(as.data.frame(model.ggwr_Krakow$SDF[,2:18]), 4)
options(repr.plot.width=5, repr.plot.height=3)
ggplot(data=df_Krakow, aes(long, lat)) + geom_point(aes(colour = as.factor(clusters1$cluster)), size = 2.5, alpha = 0.7) +
labs(title = '4 clusters, results from kmeans()')
In the figure above, we can see a division into 2 main vertical clusters (second and third) and 2 smaller clusters (first and fourth cluster) located on the outskirts of Krakow.
df_Krakow$clust1<-rep(0, times=dim(df_Krakow)[1])
df_Krakow$clust1[clusters1$cluster==1]<-1
df_Krakow$clust2<-rep(0, times=dim(df_Krakow)[1])
df_Krakow$clust2[clusters1$cluster==2]<-1
df_Krakow$clust3<-rep(0, times=dim(df_Krakow)[1])
df_Krakow$clust3[clusters1$cluster==3]<-1
df_Krakow$clust4<-rep(0, times=dim(df_Krakow)[1])
df_Krakow$clust4[clusters1$cluster==4]<-1
eq1 <- inpost ~ poczta + dhl + dpd + ruch + ups +
buildings + parks + forests + schools + railways + cycleways + parkings + crossings +
tot_log + male_65_log + fem_ratio + clust1 + clust2 + clust3
model.ols_Krakow<-lm(eq1, data=df_Krakow)
summary(model.ols_Krakow)
Due to the quite large number of insignificant variables, we eliminated these variables.
eq2<-inpost ~ ruch + ups +
parks + forests + schools + cycleways + parkings + crossings +
tot_log + clust1 + clust2 + clust3
# a-spatial linear model
model.ols_Krakow2<-lm(eq2, data=df_Krakow)
summary(model.ols_Krakow2)
Adjusted R2 for "Kraków model" is equal to 41%. Comparing this result to model for Warszawa, we can say that distribution of Kraków Inpost machines is definitely more explainable than in Warszawa. Of course, all variables are also jointly significant. In contrast to Warszawa, clusters division for Kraków is definitely significant. The number of Inpost points is higher in cluster 1 (observations in the center of Kraków) than in cluster 4 (observations on the outskirts of Kraków), while the number of inpost points is lower in cluster 2 (points located on the north west of Kraków) than in cluster 4 (observations on the outskirts of Kraków). Estimate for cluster 3 (center of Kraków) is insignificant (no difference between third and fourth cluster). We can see definitely heterogeneity of this phenomenon in Kraków. Regarding other variables, absolutely more variables are significant than in the case of Warszawa. The greater number of Ruch and UPS points indicates a greater number of Inpost points. In the case of points of interest, we can see that the greater number of cycleways, parkings and crossings also increases the number of Inpost points, while the number of green areas (parks and forests) and schools reduces the number of Inpost points. Finally, the only demographic variable that is significant is total population. As expected, the more people, the larger the dependent variable.
stargazer(model.ols_Warszawa, model.ols_Warszawa2, model.ols_Krakow, model.ols_Krakow2, type = 'text')
Comparing the results obtained for Warszawa and Kraków, we can see that the adjusted R2 is definitely higher for Kraków than for Warszawa. It should be also noted that Inpost comes from Kraków - more regular and more logical arrangement of machines. Division into clusters is definitely more appropriate for modeling for Kraków. What is really intersting, considering significant variables, more of them have the same impact for both cities (positive impact of number of Ruch and UPS points on number of Inpost points). As expected, total population - more inpost points and in turn, negative impact of parks (green areas) on number of inpost points.
When answering the hypotheses, it should be stated that all of them are positively considered. Inpost has parcel machines arranged in accordance with the competition (primarily Ruch and UPS). Control variables have a significant impact on the number of Inpost parcel machines. The use of GWR led to the verification of the above hypotheses and to the conclusion that spatial drift turned out to be significant.
Summarizing all the conclusions, we clearly state InPost parcel points are deployd according to the points of other operators. Considering the control variables, mostly their influence on the number of InPost points is significant. We also confirmed the significance of spatial effects in multivariate econometric models.